Hive: Map Joins

Map join is used when one of the join tables is small enough to fit in the memory.Whenever, we apply join operation, the job will be assigned to a Map Reduce task which consists of two stages- a Map stage and a Reduce stage. A mapper’s job during Map Stage is to “read” the data from join tables and to “return” the ‘join key’ and ‘join value’ pair into an intermediate file. Further, in the shuffle stage, this intermediate file is then sorted and merged. The reducer’s job during reduce stage is to take this sorted result as input and complete the task of join.

https://cdn.edureka.co/blog/wp-content/uploads/2013/11/joins1.png

Map-side Join is similar to a join but all the task will be performed by the mapper alone. The Map-side Join will be mostly suitable for small tables to optimize the task.

How will the map-side join optimize the task?

Assume that we have two tables of which one of them is a small table. When we submit a map reduce task, a Map Reduce local task will be created before the original join Map Reduce task which will read data of the small table from HDFS and store it into an in-memory hash table. After reading, it serializes the in-memory hash table into a hash table file.

In the next stage, when the original join Map Reduce task is running, it moves the data in the hash table file to the Hadoop distributed cache, which populates these files to each mapper’s local disk. So all the mappers can load this persistent hash table file back into the memory and do the join work as before. The execution flow of the optimized map join is shown in the figure below. After optimization, the small table needs to be read just once. Also if multiple mappers are running on the same machine, the distributed cache only needs to push one copy of the hash table file to this machine.

Optimizing-task-with-map-side-join

Advantages of using map side join:

Map-side join helps in minimizing the cost that is incurred for sorting and merging in the shuffle and reduce stages.
Map-side join also helps in improving the performance of the task by decreasing the time to finish the task.

Disadvantages of Map-side join:
Map side join is adequate only when one of the tables on which you perform map-side join operation is small enough to fit into the memory. Hence it is not suitable to perform map-side join on the tables which are huge data in both of them.

Now, we will perform the Map-side join on these two datasets. Map-side joins can be enabled in two ways.

Map Join

By specifying the keyword, /*+ MAPJOIN(b) */ in the join statement.
By setting the following property to true. hive.auto.convert.join=true

For performing Map-side joins, there should be two files, one is of larger size and the other is of smaller size. As it is a Map-side join, the number of reducers will be set to 0 automatically.You can set the small file size by using the following property:

hive.mapjoin.smalltable.filesize=(default it will be 25MB)

Now, let us perform Map-side joins and join the two datasets based on their IDs.

select /*+ MAPJOIN(o) */ c.customer_id,c.customer_fname,o.order_customer_id
from customers c inner join orders o
on c.customer_id=o.order_id;

What is Sort Merge Bucket Join in Hive?

If your table is large you can not do map side join. SMB is a join performed on bucket tables if

Both table sorted on same column in same order.
Both tables bucket on the same columns
Both table have join condition columns.

While joining each mapper reads a bucket from the first table and the corresponding bucket from the second table and then a merge sort bucket join is performed. When the tables are large we can use Hive Sort Merge Bucket join. However, using the join columns, all join the columns are bucketed and sorted in SMB. Although, make sure in SMB join all tables should have the same number of buckets.

hive> set hive.enforce.bucketing=true;
hive> set hive.enforce.sorting=true;

Now the stage is set to perform SMB Map Join to optimize Hive joining. Again, make some changes in properties to perform SMB Map join.

hive>set hive.enforce.sortmergebucketmapjoin=false;
hive>set hive.auto.convert.sortmerge.join=true;
hive>set hive.optimize.bucketmapjoin = true;
hive>set hive.optimize.bucketmapjoin.sortedmerge = true;
hive>set hive.auto.convert.join=false; // if we do not do this, automatically Map-Side Join will happen

SELECT u.name,u.salary FROM buck_dept d INNER JOIN buck_emp u ON d.id = u.id;

Hive

Map Joins

No comments:

Post a Comment